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We  merged  results  obtained  from  the  Category  B  index  with  results  obtained  from  the  index  built 
over  complete  (Category  A)  anchor  text.  However,  we  were  unable  to  improve  over  Category  B 
results  in  either  the  ad  hoc  or  the  diversity  task. 


1.  INTRODUCTION 

Associating  anchor  text  with  pages,  to  which  links  are  pointing,  is  a  well-known 
approach  to  improve  retrieval  quality.  It  was  used  in  the  first  version  of  Google 
[Brin  and  Page  1998].  On  one  hand,  using  the  anchor  text  alone  allows  one  to 
obtain  a  system  with  decent  performance  [Anh  and  Moffat  2010;  Hiemstra  and 
Hauff  2010].  We  also  know  that  the  anchor  text  is  a  strong  relevance  signal  from 
our  own  experiments  in  TREC  2011  [Boytsov  and  Belova  2011].  On  the  other 
hand,  the  size  of  the  anchor  text  is  much  smaller  than  size  of  the  text  for  a  full 
collection.  Thus,  enriching  the  Category  B  index  (built  over  50M  documents)  with 
the  Category  A  anchor  text  index  (built  over  370M  short  documents),  seemed  to 
be  an  appealing  method  of  improving  performance  at  little  cost. 

2.  EXPERIMENTS 

2.1  Setup 

We  used  two  retrieval  engines.  One  was  a  system  developed  for  TREC  2011,  which 
included  an  index  for  the  Category  B  subset  (50M  documents).  It  explicitly  indexed 
posting  lists  of  close  word  pairs  (where  at  least  one  word  was  frequent)  and  had  a 
large  index  of  513  Gb.  The  detailed  description  of  this  type  of  index  is  given  in  our 
2010  and  2011  reports  [Boytsov  and  Belova  2010;  2011].  There  are  more  than  20 
relevance  features  combined  in  a  semi-linear  formula.  In  TREC  2011,  we  showed 
that  this  system  was  a  strong  benchmark:  See  the  run  srchvrsllb  (Table  2)  in  the 
overview  paper  by  Clarke  et  al.  [2011]. 

In  addition,  we  built  a  similar  index  over  the  Category  A  anchor  text  (Category 
A  anchor  text  was  compiled  by  Hiemstra  and  Hauff  [2010]).  Unlike  the  Category 
B  index,  it  employed  only  one  text  field:  anchor  text.  The  anchor  text  index  relied 
on  the  SpamRank,  but  not  on  the  PageRank.  The  number  of  documents  was  about 
370M.  However,  each  document  was  small  and  the  size  of  the  index  was  only  212  Gb. 

2.2  Results 

We  tried  several  approaches  to  combine  scores  from  two  retrieval  systems:  a  linear 
combination  of  scores  with  different  dictionaries,  a  linear  combination  of  scores  with 
the  shared  dictionary,  and  a  round-robin  method.  In  the  approach  with  the  shared 
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dictionary,  we  used  IDF  values  only  from  the  Category  B  dictionary.  None  of  the 
approaches  allowed  us  to  achieve  higher  performance  scores  on  training  data  in  the 
ad  hoc  task.  However,  we  obtained  slightly  higher  values  of  the  diversity  metric 
a-nDCG@20. 

Overall,  we  submitted  three  runs  srchvrsl2cl0,  srchvrsl2c09,  and  srchvrsl2c00, 
where  anchor  text  scores  were  summed  up  with  with  Category  B  scores.  Prior  to 
aggregating,  anchor  text  scores  were  multiplied  by  the  scaling  coefficients  1,  0.9, 
and  0  respectively.  The  last  run  (srchvrsl2c00)  represents  a  “pure”  Category  B 
run. 

It  turned  out  that  all  three 
runs  had  almost  identical  diversity 
scores  ERR-IA@20,  which  were  ap¬ 
proximately  equal  to  0.38.  The 
ad  hoc  scores  were  very  similar 
as  well:  for  example,  ERR@20 
was  approximately  equal  to  0.305. 
We  see  that  both  srchvrsl2c09  and 
srchvrsl2cl0  improved  in  MAP 
over  the  pure  Category  B  run 
srchvrsl2c00  (this  was  a  statisti¬ 
cally  significant  improvement  that 
“survived”  the  Holm-Bonferroni  correction).  Yet,  these  improvements  were  small: 
2.3%  and  5%,  respectively. 

We  also  compared  performance  of  runs  that  relied  solely  on  Category  A  anchor 
text  with  performance  of  Category  B  runs  (same  algorithm  as  for  srchvrsl2c00). 
According  to  Table  I,  in  2010-2011  the  values  of  ERR@20  for  anchor  text  runs 
were  only  slightly  higher  than  1  /2  of  ERR@20  scores  for  the  respective  Category 
B  runs.  In  2012,  however,  the  anchor  text  run  had  almost  4x  weaker  performance 
compared  to  the  Category  B  run.  This  may  partially  explain  the  fact  that  combining 
anchor  text  runs  and  Category  B  runs  did  not  lead  to  noticeable  improvement  in 
performance. 

Finally,  we  evaluated  an  effect  of  not  using  SpamRank  in  2010,  2011,  and  2012. 
To  do  this,  we  set  the  SpamRank  factor  to  1  (it  is  included  multiplicatively) .  We 
found  that  2010  was  the  only  year  in  which  the  SpamRank  improved  ERR@20  scores 
of  our  method.  This  is  in  contrast  with  our  2010  observation  that  SpamRanks  can 
improve  performance  scores  by  a  large  margin  [Boytsov  and  Belova  2010].  Perhaps, 
a  more  advanced  system,  which,  among  other  factors,  includes  anchor  text,  is  more 
robust  to  spam.  It  may  also  indicate  that  embedding  a  good  relevance  feature 
into  an  already  strong  baseline  does  not  necessarily  lead  to  a  performance  boost 
[Armstrong  et  al.  2009]. 

3.  CONCLUSIONS 

We  merged  results  obtained  from  the  Category  B  index  with  results  obtained  from 
the  index  built  over  complete  (Category  A)  anchor  text.  Yet,  this  approach  did 
not  lead  to  a  significant  improvement  in  performance.  We  hypothesize  that  simple 
merging  approaches  (such  as  linear  combinations  or  round-robin)  do  not  work  well 


Table  I:  Comparing  performance  of  runs  based 
on  Category  A  anchor  text  against  perfor¬ 
mance  of  Category  B  runs  (for  different 
years). 


year 

2010 

2011 

2012 

anchor  text 

0.056 

0.084 

0.079 

Category  B  run 

0.106 

0.137 

0.307 

Scores  are  computed  using  ERR@20 
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if  one  of  the  systems  has  a  much  lower  performance  than  the  other. 
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